Performance Evaluation of Numeric Compute Kernels on nVIDIA GPUs
نویسندگان
چکیده
Graphics processing units provide an astonishing number of floating point operations per second and deliver memory bandwidths of one magnitude greater than common general purpose central processing units. With the introduction of the Compute Unified Device Architecture, a first step was taken by nVIDIA to ease access to the vast computational resources of graphics processing units. The aim of this thesis is to shed light onto the general hardand software structures of this promising architecture. In contrast to well established high performance architectures which offer moderate on chip parallelism, graphics processing units use massive parallelism at the thread level. Thus, parallelization approaches are required which exploit a substantially finer level of parallelism as compared to OpenMP parallelization on standard multi-core and multi-socket servers. Basic benchmark kernels as well as libraries are investigated to demonstrate the basic parallelization approaches and potentials regarding peak performance and main memory bandwidth. A kernel from a computational fluid dynamics solver based on the lattice Boltzmann method is introduced and evaluated in terms of implementation issues and performance. Substantial work has to be invested in low level hand optimization to get the full capabilities of graphics processing units even for this simple computational fluid dynamics kernel. For selected verification cases, the optimized kernel outperforms a standard two socket server in single-precision accuracy by almost one order of magnitude.
منابع مشابه
FATSEA – An Architectural Simulator for General Purpose Computing on GPUs
We present FATSEA, a functional and performance evaluation simulator written in C++ to handle kernels written in the CUDA programming language aimed for GPGPU computing. FATSEA takes a Parallel Thread eXecution (PTX ) code as input, which is a device independent code format generated by the Nvidia CUDA compiler, to validate results and estimate performance on Nvidia platforms. This paper shows ...
متن کاملPerformance Analysis of Application Kernels in Multi/Many-Core Architectures
In recent years, advancement in technology and computing led to huge amounts of data being generated. Thus, HighPerformance Computing (HPC) plays an ever growing role in processing these large datasets in a timely fashion. Our analysis consist of few important throughput computing app kernels which have high degree of parallelism and makes them excellent candidates for evaluation on high end mu...
متن کاملA Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs
GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on NVIDIA GPUs. However, there are various performance-influencing factors specific to GPU architectures that need to be accurately characterized to effectively utilize...
متن کاملEvaluation of Directive-based Performance Portable Programming Models
We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, and we document how ...
متن کاملIterative statistical kernels on contemporary GPUs
We present a study of three important kernels that occur frequently in iterative statistical applications: Multi-Dimensional Scaling (MDS), PageRank, and K-Means. We implemented each kernel using OpenCL and evaluated their performance on NVIDIA Tesla and NVIDIA Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. By examining ...
متن کامل